Intro to Data Visualization with Matplotlib

Why is data visualization important?

“A picture is worth a thousand words”

Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.

What is Matplotlib?

The matplotlib Python library, developed by John Hunter and many other contributors, is used to create high-quality graphs, charts, and figures. The library is extensive and capable of changing very minute details of a figure.

Some basic concepts and functions provided in matplotlib are

Figure and axes

The entire illustration is called a figure and each plot on it is an axes (do not confuse Axes with Axis). The figure can be thought of as a canvas on which several plots can be drawn. We obtain the figure and the axes using the subplots() function

Plotting

The very first thing required to plot a graph is data. A dictionary of key-value pairs can be declared, with keys and values as the x and y values. After that, scatter(), bar(), and pie(), along with tons of other functions, can be used to create the plot.

Axis

The figure and axes obtained using subplots() can be used for modification. Properties of the x-axis and y-axis (labels, minimum and maximum values, etc.) can be changed using Axes.set()

Anatomy of a Figure

A matplotlib visualization is a figure onto which is attached one or more axes. Each axes has a horizontal (x) axis and vertical (y) axis, and the data is encoded using color and glyphs such as markers (for example circles) or lines or polygons (called patches). The figure below annotates these parts of a visualization and was created by Nicolas P. Rougier using matplotlib. The source code can be found in the matplotlib documentation. anatomy

Basic Charts / Graphs

  • Line Plot

  • Histograms

  • Scatter Plot

  • Box plots

  • Bar Plots

  • Pie Chart

Take a look at data types

Import Library

import matplotlib.pyplot as plt # plt is the convention, also known as nickname 

Line Plot

  • Line Graph: a graph that shows information that is connected in some way (such as change over time)

# <---Fake Data for Plotting---->
# Median ages 
ages = [25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]

# Median Microbiologist Salaries by Age
mib_salary = [38496, 42000, 46752, 49320, 53200, 56000, 62316, 64928, 67317, 68748, 73752]

# Median Pharmacist Salaries by Age
pharma_salary = [45372, 48876, 53850, 57287, 63016,65998, 70003, 70000, 71496, 75370, 83640]

# Median Cader Salaries by Age
bcs_salary = [37810, 43515, 46823, 49293, 53437,56373, 62375, 66674, 68745, 68746, 74583]

Create Simple Line Plot

plt.plot(ages, mib_salary)
plt.show() 
../_images/MPL01-Intro to Data Visualization_11_0.png

Add Title, Labels

plt.plot(ages, mib_salary)
plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.show() 
../_images/MPL01-Intro to Data Visualization_13_0.png

Comparison

# Microbiology
plt.plot(ages, mib_salary)
# Pharmacy
plt.plot(ages, pharma_salary)
plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.show() 
../_images/MPL01-Intro to Data Visualization_15_0.png

Add Legend

# Microbiology
plt.plot(ages, mib_salary, label="Microbiology")
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy")
# BCS
plt.plot(ages, bcs_salary, label= "BCS")

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.show() 
../_images/MPL01-Intro to Data Visualization_17_0.png

Legend Position

# Microbiology
plt.plot(ages, mib_salary, label="Microbiology")
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy")
# BCS
plt.plot(ages, bcs_salary, label= "BCS")

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend(loc="best") 
plt.show() 
../_images/MPL01-Intro to Data Visualization_19_0.png

Nicely fit in the figure

# Microbiology
plt.plot(ages, mib_salary, label="Microbiology")
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy")
# BCS
plt.plot(ages, bcs_salary, label= "BCS")

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.tight_layout()
plt.show() 
../_images/MPL01-Intro to Data Visualization_21_0.png

Customization

SEE MORE: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html

  • Color

  • Marker

  • Marker Size

  • Line Style

  • Line width

# Microbiology
plt.plot(ages, mib_salary, label="Microbiology",  color="b", linewidth=2, marker='o')
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy", color="red", linewidth=3, marker='x')
# BCS
plt.plot(ages, bcs_salary, label= "BCS", linewidth=4, linestyle='--')

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.tight_layout()
plt.show() 
../_images/MPL01-Intro to Data Visualization_23_0.png

Hex Color

# Microbiology
plt.plot(ages, mib_salary, label="Microbiology", color='#444444')
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy")
# BCS
plt.plot(ages, bcs_salary, label= "BCS", linestyle='--')

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.tight_layout()
plt.show() 
../_images/MPL01-Intro to Data Visualization_25_0.png

Font Size

# Microbiology
plt.plot(ages, mib_salary, label="Microbiology",  color="b", linewidth=2, marker='o')
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy", color="red", linewidth=3, marker='x')
# BCS
plt.plot(ages, bcs_salary, label= "BCS", linewidth=4, linestyle='--')

plt.xlabel('Ages', fontsize=12)
plt.ylabel('Median Salary (USD)', fontsize=12)
plt.title('Median Salary (USD) by Age', fontsize=12)
plt.legend() 
plt.tight_layout()
plt.show() 
../_images/MPL01-Intro to Data Visualization_27_0.png

Figure Size

plt.figure(figsize=(10,6))
# Microbiology
plt.plot(ages, mib_salary, label="Microbiology",  color="b", linewidth=2, marker='o')
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy", color="red", linewidth=3, marker='x')
# BCS
plt.plot(ages, bcs_salary, label= "BCS", linewidth=4, linestyle='--')

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.tight_layout()
plt.show() 
../_images/MPL01-Intro to Data Visualization_29_0.png

Add Style or Themes

# Available styles 
plt.style.available
plt.figure(figsize=(10,6))
# plt.style.use("fivethirtyeight")
# Microbiology
plt.plot(age, mib_salary, label="Microbiology",  color="b", linewidth=2, marker='o')
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy", color="red", linewidth=3, marker='x')
# BCS
plt.plot(ages, bcs_salary, label= "BCS", linewidth=4, linestyle='--')

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.tight_layout()
plt.show() 

Exporting Figure & Put it Together

plt.figure(figsize=(10,6))
# plt.style.use("fivethirtyeight")
# Microbiology
plt.plot(ages, mib_salary, label="Microbiology",  color="b", linewidth=2, marker='o')
# Pharmacy
plt.plot(ages, pharma_salary, label= "Pharmacy", color="red", linewidth=3, marker='x')
# BCS
plt.plot(ages, bcs_salary, label= "BCS", linewidth=4, linestyle='--')

plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.legend() 
plt.tight_layout()
plt.savefig("median_salary.pdf") # .pdf, .png, .jpg, .jepg, .tiff
plt.show() 
../_images/MPL01-Intro to Data Visualization_34_0.png

Subplots

In fiction, a subplot is a secondary strand of the plot that is a supporting side story for any story or the main plot. Subplots may connect to main plots, in either time and place or in thematic significance. Subplots often involve supporting characters, those besides the protagonist or antagonist.

fig, ax = plt.subplots(nrows=2, ncols=3)
print(ax)
[[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff633127a90>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7ff633193090>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7ff6331c75d0>]
 [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff6332e8e10>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7ff63337f5d0>
  <matplotlib.axes._subplots.AxesSubplot object at 0x7ff6335340d0>]]
../_images/MPL01-Intro to Data Visualization_36_1.png
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
../_images/MPL01-Intro to Data Visualization_37_0.png
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)  
# plt.style.use("fivethirtyeight")
# Microbiology
ax1.plot(ages, mib_salary, label="Microbiology",  color="b", linewidth=2, marker='o')
# Pharmacy
ax1.plot(ages, pharma_salary, label= "Pharmacy", color="red", linewidth=3, marker='x')
# BCS
ax2.plot(ages, bcs_salary, label= "BCS", linewidth=4, linestyle='--')

ax1.set_xlabel('Ages')
ax1.set_ylabel('Median Salary (USD)')
ax1.set_title('Median Salary (USD) by Age')
ax1.legend() 

ax2.set_xlabel('Ages')
ax2.set_ylabel('Median Salary (USD)')
ax2.set_title('Median Salary (USD) by Age')
ax2.legend() 

plt.tight_layout()
plt.savefig("median_salary.pdf") # .pdf, .png, .jpg, .jepg, .tiff
plt.show() 
../_images/MPL01-Intro to Data Visualization_38_0.png

Histograms

  • Histogram: a graphical display of data using bars of different heights.

  • The height of each bar shows how many fall into each range.

  • And you decide what ranges to use! hist

Create Histogram

import random
weight = [random.random() for i in range(20)]
plt.hist(weight) 
plt.xlabel("Weight")
plt.ylabel("Frequency")
plt.title("Distribution of Weight")
plt.show() 
../_images/MPL01-Intro to Data Visualization_42_0.png
plt.hist(weight, bins=15) 
plt.xlabel("Weight")
plt.ylabel("Frequency")
plt.title("Distribution of Weight")
plt.show() 
../_images/MPL01-Intro to Data Visualization_43_0.png

Scatter Plot

A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

# Height in(m)
height = [1.4, 1.2, 1.5, 1.3, 1.6, 1.5]
# Weight in (Kg) 
weight = [60, 15, 85, 74, 77, 65]
plt.scatter(weight, height)
plt.xlabel("Weight")
plt.ylabel("Weight") 
plt.title("Scatter Plot of Height & Weight") 
plt.tight_layout() 
plt.show() 
../_images/MPL01-Intro to Data Visualization_46_0.png

Box Plot

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are

import random
collection = [random.random() for i in range(10)]
plt.boxplot(collection)
plt.show() 
../_images/MPL01-Intro to Data Visualization_48_0.png

Side by Side Box Plot

import random
collectn_1 = [random.random() for i in range(10)]
collectn_2 = [random.random() for i in range(15)]
collectn_3 = [random.random() for i in range(20)]
collectn_4 = [random.random() for i in range(15)]
collections = [collectn_1, collectn_2, collectn_3, collectn_4]
plt.boxplot(collections)
plt.show() 
../_images/MPL01-Intro to Data Visualization_50_0.png

Bar Plot

# <---Fake Data for Plotting---->
# Gender 
gender = ["Female", "Male", "Female", "Male", "Male", "Female"]

# Age group 
age_group = ["Adult", "Child", "Adult", "Adult","Adult","Elderly"]

# Height in(m)
height = [1.4, 1.2, 1.5, 1.3, 1.6, 1.5]

# Weight in (Kg) 
weight = [60, 15, 85, 74, 77, 65]
plt.bar(gender, height)
plt.xlabel("Gender")
plt.ylabel("Height") 
plt.title("Example Bar Graph")
plt.show() 
../_images/MPL01-Intro to Data Visualization_53_0.png
plt.bar(age_group, height)
plt.xlabel("Age Group")
plt.ylabel("Height") 
plt.title("Example Bar Graph")
plt.show() 
../_images/MPL01-Intro to Data Visualization_54_0.png

Histograms vs Bar Graphs

  • Bar Graphs are good when your data is in categories (such as “Comedy”, “Drama”, etc).

  • But when you have continuous data (such as a person’s height) then use a Histogram.

  • It is best to leave gaps between the bars of a Bar Graph, so it doesn’t look like a Histogram. hist

Pie Chart

  • Pie Chart: a special chart that uses “pie slices” to show relative sizes of data.

  • It is a really good way to show relative sizes: it is easy to see which movie types are most liked, and which are least liked, at a glance. pie

Create Simple Pie Chart

# <---Fake Data for Plotting---->
labels = ["A", "T", "G", "C", "N"]
sizes = [1057, 1184, 2089, 1267, 89]
plt.pie(sizes, labels=labels)
plt.show() 
../_images/MPL01-Intro to Data Visualization_59_0.png

Add Percentage

plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.show() 
../_images/MPL01-Intro to Data Visualization_61_0.png

Add Colors

colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
plt.show() 
../_images/MPL01-Intro to Data Visualization_63_0.png

Shadow

plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, shadow=True)
plt.show() 
../_images/MPL01-Intro to Data Visualization_65_0.png

Explode

explode = (0, 0, 0.1, 0, 0)   # only "explode" the 3rdslice (i.e. 'Hogs')
plt.figure(figsize=(10,6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, shadow=True, startangle=90, explode=explode)
plt.legend(loc="best")
plt.tight_layout() 
plt.show() 
../_images/MPL01-Intro to Data Visualization_68_0.png